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Foreword 



This Technical Specification has been produced by the 3GPP. 

This document specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission (DTX) as 
described in [3], 

The contents of the present document are subject to continuing work within the TSG and may change following formal 
TSG approval. Should the TSG modify the contents of this TS, it will be re-released by the TSG with an identifying 
change of release date and an increase in version number as follows: 

Version x.y.z 

where: 

x the first digit: 

1 presented to TSG for information; 

2 presented to TSG for approval; 

3 Indicates TSG approved document under change control. 

y the second digit is incremented for all changes of substance, i.e. technical enhancements, corrections, 
updates, etc. 

z the third digit is incremented when editorial only changes have been incorporated in the specification; 
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Scope 



This document specifies the Voice Activity Detector (VAD) to be used in the Discontinuous Transmission (DTX) as 
described in [3]. 

The requirements are mandatory on any VAD to be used either in User Equipment (UE) or Base Station Systems 
(BSS)s that utilize the AMR wideband speech codec. 



Normative References 



The following documents contain provisions which, through reference in this text, constitute provisions of the present 
document. 

References are either specific (identified by date of publication, edition number, version number, etc.) or 
non-specific. 

For a specific reference, subsequent revisions do not apply. 

For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including 
a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same 
Release as the present document. 

[1] 3GPP TS 26.173: "ANSI-C code for the Adaptive Multi-Rate Wideband speech codec" . 

[2] 3GPP TS 26.190: "Speech codec speech processing functions; Adaptive Multi-Rate - Wideband 

(AMR-WB) speech codec; Transcoding functions" . 

[3] 3GPP TS 26.193: "Speech codec speech processing functions; Adaptive Multi-Rate - Wideband 

(AMR-WB) speech codec; Source controlled rate operation". 

[4] ITU, The International Telecommunications Union, Blue Book, Vol. Ill, Telephone Transmission 

Quality, IXth Plenary Assembly, Melbourne, 14-25 November, 1988, Recommendation G.711, 
Pulse code modulation (PCM) of voice frequencies. 

[5] 3GPP TR 21.905: "Vocabulary for 3GPP Specifications". 



3 Technical Description 

3.1 Definitions, symbols and abbreviations 

3.1.1 Definitions 

For the purposes of the present document, the terms and definitions given in TR 21.905 [5] and the following apply. A 
term defined in the present document takes precedence over the definition of the same term, if any, in TR 2 1 .905 [5] . 

frame: Time interval of 20 ms corresponding to the time segmentation of the speech 
transcoder. 

3.1.2 Symbols 

For the purposes of this TS, the following symbols apply. 

3.1.2.1 Variables 

bckr_est[n] background noise estimate at the frequency band "n" 
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burst_count counts length of a speech burst, used by VAD hangover addition 

hang_count hangover counter, used by VAD hangover addition 

level[n] signal level at the frequency band "n" 

new_speech pointer of the speech encoder, points a buffer containing last received samples of a speech frame 
[2] 

noisejevel estimated noise level 

pow_sum input power 

s(i) samples of the input frame 

snr_sum measure between input frame and noise estimate 

speechjevel estimated speech level 

stat_count stationary counter 

stat_rat measure indicating stationary of the input frame 

tone_flag flag indicating the presence of a tone 

vad_thr VAD threshold 

VAD_flag Boolean VAD flag 

vadreg intermediate VAD decision 

3.1.2.2 Constants 

ALPHA_UP1 constant for updating noise estimate (see subclause 3.3.5.2) 



ALPHA_DOWNl 

ALPHAJJP2 

ALPHA_DOWN2 

ALPHA3 

ALPHA4 

ALPHA5 

BURST_HIGH 

BURST_P1 

BURST_SLOPE 

COEFF3 

COEFF5_l 

COEFF5_2 

HANG_HIGH 

HANG_LOW 

HANG_P1 

HANG_SLOPE 

FRAME_LEN 



constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating noise estimate (see subclause 3.3.5.2) 
constant for updating average signal level (see subclause 3.3.5.2) 
constant for updating average signal level (see subclause 3.3.5.2) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 

constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
coefficient for the filter bank (see subclause 3.3.1) 
coefficient for the filter bank (see subclause 3.3.1) 
coefficient for the filter bank (see subclause 3.3.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
constant for controlling VAD hangover addition (see subclause 3.3.5.1) 
size of a speech frame, 256 samples (20 ms) 
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MIN_SPEECH_LEVEL1 

MIN_SPEECH_LEVEL2 

MIN_SPEECH_SNR 

NO_Pl 

NO_SLOPE 

NOISE_MAX 

NOISE_MIN 

POW_TONE_THR 

SP_ACTIVITY_COUNT 

SP_ALPHA_DOWN 

SP_ALPHA_UP 

SP_CH_MAX 

SP_CH_MIN 

SP_EST_COUNT 

SP_P1 

SP_SLOPE 

STAT_COUNT 

STAT_THR 

STAT_THR_LEVEL 

THR_HIGH 

TONE_THR 

VAD_POW_LOW 



constant for speech estimation (see subclause 3.3.5.3) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
constant for VAD threshold adaptation (see subclause 3.3.5) 
maximum value for noise estimate (see subclause 3.3.5.2) 
minimum value for noise estimate (see subclause 3.3.5.2) 
threshold for tone detection (see subclause 3.3.5) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for speech estimation (see subclause 3.3.5.3) 
constant for speech estimation (see subclause 3.3.5.3) 

constant for VAD threshold adaptation (see subclause 3.3.5) 

constant for VAD threshold adaptation (see subclause 3.3.5) 

constant for speech estimation (see subclause 3.3.5.3) 

constant for VAD threshold adaptation (see subclause 3.3.5) 

constant for VAD threshold adaptation (see subclause 3.3.5) 

threshold for stationary detection (see subclause 3.3.5.2) 

threshold for stationary detection (see subclause 3.3.5.2) 

threshold for stationary detection (see subclause 3.3.5.2) 

constant for VAD threshold adaptation (see subclause 3.3.5) 

threshold for tone detection (see subclause 3.3.3) 

constant for controlling VAD hangover addition (see subclause 3.3.5.1) 



3.1.2.3 



* 
/ 

Ixl 

AND 

OR 

b 



Functions 

Addition 
Subtraction 
Multiplication 
Division 

absolute value of x 
Boolean AND 
Boolean OR 



^x(n) 



x(a) + x(a + 1) + . . . + x(b- 1) + x(b) 



MIN(x,y) 



\x,x< y 
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MAX(x,y) 



\x,x> y 



3.1.3 Abbreviations 



For the purposes of the present document, the abbreviations given in TR 21.905 [5] and the following apply. An 
abbreviation defined in the present document takes precedence over the definition of the same abbreviation, if any, in 
TR 2 1.905 [5]. 



ANSI 
DTX 
VAD 
CNG 



American National Standards Institute 
Discontinuous Transmission 
Voice Activity Detector 
Comfort Noise Generation 



3.2 General 

The function of the VAD algorithm is to indicate whether each 20 ms frame contains signals that should be transmitted, 
e.g. speech, music or information tones. The output of the VAD algorithm is a Boolean flag (VAD_flag) indicating 
presence of such signals. 



3.3 Functional description 



The block diagram of the VAD algorithm is depicted in Figure 1. The VAD algorithm uses parameters of the speech 
encoder to compute the Boolean VAD flag (VAD_flag). This input frame for VAD is sampled at the 6.4 kHz frequency 
and thus it contains 256 samples. Samples of the input frame (s(i)) are divided into sub-bands and level of the signal 
(level[n]) in each band is calculated. Input for the tone detection function are the normalized open-loop pitch gains 
which are calculated by open-loop pitch analysis of the speech encoder. The tone detection function computes a flag 
(tone_flag) which indicates presence of a signalling tone, voiced speech, or other strongly periodic signal. Background 
noise level (bckr_est[n]) is estimated in each band based on the VAD decision, signal stationarity and the tone-flag. 
Intermediate VAD decision is calculated by comparing input SNR (level[n]/bckr_est[n]) to an adaptive threshold. The 
threshold is adapted based on noise and long term speech estimates. Finally, the VAD flag is calculated by adding 
hangover to the intermediate VAD decision. 



S(i) 



ol_gain 



Filter bank 
and 

computation 
of sub-band 
levels 



Tone 
detection 




level[n] 



tonejlag 



VAD 
decision 



VAD Jag 



Figure 1 : Simplified block diagram of the VAD algorithm 
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3.3.1 Filter bank and computation of sub-band levels 

The input signal is divided into frequency bands using a 12-band filter bank (Figure 2). Cut-off frequencies for the filter 
bank are shown in Table 1 . 

Table 1. Cut-off frequencies for the filter bank 



Band number 


Frequencies 


1 


- 200 Hz 


2 


200 - 400 Hz 


3 


400 - 600 Hz 


4 


600 - 800 Hz 


5 


800 -1200 Hz 


6 


1200 -1600 Hz 


7 


1600 -2000 Hz 


8 


2000 - 2400 Hz 


9 


2400 - 3200 Hz 


10 


3200 - 4000 Hz 


11 


4000 - 4800 Hz 


12 


4800 - 6400 Hz 



Input for the filter bank is a speech frame pointed by the new_speech pointer of the speech encoder [1]. Input values for 
the filter bank are scaled down by one bit. This ensures safe scaling, i.e. saturation can not occur during calculation of 
the filter bank. 



5th order 
filter block 



5th order 
filter block 



5th order 
filter block 



3rd order 
filter block 



5th order 
filter block 



5th order 
filter block 



~u 


3rd order 
filter block 








3rd order 
filter block 


-^ 



3rd order 
filter block 



3rd order 
filter block 



-► 4.8 -6.4 kHz 
-> 4.0 - 4.8 kHz 



-> 3.2 - 4.0 kHz 



-> 2.4 -3.2 kHz 



-> 2.0 -2.4 kHz 
-> 1.6 -2.0 kHz 



->1.2- 1.6 kHz 



-> 0.8- 1.2 kHz 
-> 0.6 -0.8 kHz 

-> 0.4 -0.6 kHz 



3rd order 
filter block 



0.2 -0.4 kHz 



0.0 - 0.2 kHz 



Figure 2: Filter bank 

The filter bank consists of 5 th and 3 rd order filter blocks. Each filter block divides the input into high-pass and low-pass 
parts and decimates the sampling frequency by 2. The 5 th order filter block is calculated as follows: 

x lp (i) = 0.5*(A l (x(2*i)) + A 2 (x(2*i + l))) (la) 
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x hp (i) = 0.5 * (A, (x(2 * 0) - A 2 (42 * * + 1))) 

where 

x(i) input signal for a filter block 

X/ (i) low-pass component 

X. (i) high-pass component 
The 3 rd order filter block is calculated as follows: 

x lp (i) = 0.5 * (42 *i + l) + A 3 (42 * /))) 
x hp (i) = 0.5 * «2 * i + 1) - A, (x(2 * /))) 



(lb) 



(2a) 
(2b) 



The filters A x () , A 2 () , and A 3 () are first order direct form all-pass filters, whose transfer function is given by: 



A(z) = 



C + z 



(3) 



1+C*z 

where C is the filter coefficient. 

Coefficients for the all-pass filters A l () , A 2 () , and A 3 () are COEFF5_l, COEFF5_2, and COEFF3, respectively. 

Signal level is calculated at the output of the filter bank at each frequency band as follows: 



level(n)= ^|x„(0|, 



(4) 



i=START„ 



where: 



n index for the frequency band 

X n (i) sample i at the output of the filter bank at frequency band n 



START ={ 



END„ = 



-6, l<n<4 
-12, 5<n<8 
-24, 9<n<ll 
-48, n=12 

7, l<n<4 
15, 5<n<8 
31, 9<n<ll 
63, n = 12 



Negative indices of X n (i) refer to the previous frame. 

3.3.2 Tone detection 

The purpose of the tone detection function is to detect information tones, vowel sounds and other periodic signals. The 
tone detection uses normalized open-loop pitch gains (ol_gain), which are received from the speech encoder. If the pitch 
gain is higher than the constant TONE_THR, tone is detected and the tone flag is set: 
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if (ol_gain > TONE_THR) 

tone_flag = 1 

The open-loop pitch search and correspondingly the tone flag is computed twice in each frame, except for mode 6.60 
kbit/s, where it is computed only once. 

3.3.3 VAD decision 

The block diagram of the VAD decision algorithm is shown in figure 3. 



level [n] 



SNR 

Computation 



tone_flag 



bckr_est[n] 



Background 

Noise 

Estimation 



Comparison 



vad thr 



X vadreg 



Hangover 
Addition 



VAD_flag 



Speech 
Estimation 



noise_lei'el sjeech_level 



Threshold 
Adaptation 



Figure 3: Simplified block diagram of the VAD decision algorithm 

Power of the input frame is calculated as follows: 



FRAME _ LEN 



frame _ pow = 2_, s (i) * s (i) > 



(5) 



where samples s(i) of the input frame are pointed by the new_speech pointer of the speech encoder. Variable pow_sum 
is sum of the powers of the current and previous frames. If pow_sum is lower than the constant POW_TONE_THR, 
tone-flag is set to zero. 

The difference between the signal levels of the input frame and the background noise estimate is calculated as follows: 

12 



snr _ sum = ^MAX (1.0, , l ^M_ ) \ 



bckr _est[n] 



(6) 



where: 

level[n] signal level at band n 

bckr_est[n] level of background noise estimate at band n 
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VAD decision is made by comparing the variable snr_sum to a threshold. The threshold (vad_thr) is adapted to get 
desired sensitivity depending on estimated speech and background noise levels. 

Average background noise level is calculated by adding noise estimates at each band except the lowest band: 

12 

noise _ level = ^ bckr _ est[n] 

n=2 (7) 

If SNR is lower that the threshold (MIN_SPEECH_SNR), speech level is increased as follows: 
If (speech_level/noise_level < MIN_SPEECH_SNR) 
Speechjevel = MIN_SPEECH_SNR * noisejevel 

Logarithmic value for noise estimate is calculated as follows: 

i log 2_noise_level = log 2 (noise _lev el) (8) 

Before logarithmic value from the speech estimate is calculated, MIN_SPEECH_SNR*noise_level is subtracted from 
the speech level to correct its value in low SNR situations. 

i log 2_speech_level = log 2 (speechjevel - MIN _ SPEECH _ SNR* noise _ level) (9) 

Threshold for VAD decision is calculated as follows: 

Vadjhr = NO_SLOPE * (ilog2_noise_level - NO_Pl) + THR_HIGH + MIN(SP_CH_MAX, 
MAX(SP_CH_MIN, SP_CH_MIN + SP_SLOPE * (ilog2_speech_level - SP_P1))), (10) 

where NO_SLOPE, SP_SLOPE, NO_Pl, SP_P1, THR_HIGH, SP_CH_MAX and SP_CH_MIN are constants. 

The variable vadreg indicates intermediate VAD decision and it is calculated as follows: 

if (snr_sum > vad_thr) 

vadreg = 1 
else 

vadreg = 

3.3.3.1 Hangover addition 

Before the final VAD flag is given, a hangover is added. The hangover addition helps to detect low power endings of 
speech bursts, which are subjectively important but difficult to detect. 

VAD flag is set to '1' if less that hang Jen frames with '0' decision have been elapsed since burstjen consecutive T 
decisions have been detected. The variables hangjen and burst_len are computed using vad_thr as follows: 

hangjen = MAX(HANG_LOW, (HANG_SLOPE * (vad_thr - HANG_P1) + HANG_HIGH)) (11) 

burstjen = BURST_SLOPE * (vad_thr - BURST_P1) + BURST_HIGH) (12) 

The power of the input frame is compared to a threshold (VAD_POW_LOW). If the power is lower, the VAD flag is set 
to '0' and no hangover is added. The VAD_flag is calculated as follows: 

Vad_flag = 0; 

if (pow_sum < VAD_POW_LOW) 
burst_count = 
hang_count = 
else 

if (vadreg =1) 

burst_count = burst_count + 1 
if (burst_count >= burstjen) 

hang_count = hangjen 
VADJlag = 1 
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else 

burst_count = 
if (hang_count > 0) 

hang_count = hang_count - 1 

VAD_flag=l 

3.3.3.2 Background noise estimation 

Background noise estimate (bckr_est[n]) is updated using amplitude levels of the previous frame. Thus, the update is 
delayed by one frame to avoid undetected start of speech bursts to corrupt the noise estimate. The update speed for the 
current frame is selected using intermediate VAD decisions (vadreg) and stationarity counter (stat_count) as follows: 

if (vadreg for the last 4 frames has been zero) 

alpha_up = ALPHA_UP1 

alpha_down = ALPHA_DOWNl 
else if (stat_count = 0) 

alpha_up = ALPHAJJP2 

alpha_down = ALPHA_DOWN2 
else 

alpha_up = 

alpha_down = ALPHA3 

The variable stat_count indicates stationary and its purpose is explained later in this subclause. The variables alpha_up 
and alpha_down define the update speed for upwards and downwards, respectively. The update speed for each band "n" 
is selected as follows: 

if (bckr_est m [n] < level m _\n\) 

alpha [n] = alpha_up 
else 

alpha [n] = alpha_down 

Finally, noise estimate is updated as follows: 

bckr _ est m+1 [n] = (1 .0 - alpha\n\) * bckr _ est m [n] + alpha[n]* level m _ l [n], (13) 

where: 

n index of the frequency band 

m index of the frame 

Level of the background estimate (bckr_est[n]) is limited between constants NOISE_MIN and NOISE_MAX. 

If level of background noise increases suddenly, vadreg will be set to "1" and background noise is not normally updated 
upwards. To recover from this situation, update of the background noise estimate is enabled if the intermediate VAD 
decision (vadreg) is T for long enough time and spectrum is stationary. Stationary (stat_rat) is estimated using 
following equation: 

^ MAX (STAT THR LEVEL, MAX(ave level Jn], level Jnb) 

stat rat = > = = = — (14) 

tt MAX(STAT_THR_LEVEL,MIN(ave _level m [ti],level m [n])) 

where: 

STAT_THR_LEVEL a constant 

n index of the frequency band 

m index of the frame 

ave_level average level of the input signal 
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If the stationary estimate (stat_rat) is higher than a threshold, the stationary counter (stat_count) is set to the initial value 
defined by constant STAT_COUNT. If the signal is not stationary but speech has been detected (VAD decision is T), 
stat_count is decreased by one in each frame until it is zero. 

if (5 last tone flags have been one) 

stat_count = STAT_COUNT 
else 

if (8 last internal VAD decisions have been zero) OR (stat_rat > STAT_THR) 

stat_count = STAT_COUNT 
else 

if (vadreg) AND (stat_count ^ 0) 
stat_count = stat_count - 1 

The average signal levels (ave_level[n]) are calculated as follows: 

ave _ level m+l [n] = (1 .0 - alpha) * ave _ level m [n] + alpha * level m [n] (15) 

The update speed (alpha) for the previous equation is selected as follows: 

if (stat_count = STAT_COUNT) 

alpha = 1.0 
else if (vadreg = 1) 

alpha=ALPHA5 
else 

alpha = ALPHA4 

3.3.3.3 Speech level estimation 

First, full-band input level is calculated by summing input levels in each band except the lowest band as follows: 

12 



in _ level = ^jT /eve/ [n J (16) 



n=2 

A frame is assumed to contain speech if its level if high enough (MIN_SPEECH_LEVEL1), and the intermediate VAD 
flag (vadreg) is set or the input level is higher than the current speech level estimate. Maximum level (sp_max) from 
SP_EST_COUNT frames is searched. If the SP_ACTIVITY_COUNT number of speech frames is located in within 
SP_EST_COUNT number of frames, speech level estimate is updated by the maximum signal level (sp_max). The 
pseudocode for the speech level estimation is as follows: 

If (SP_ACTIVITY_COUNT > SP_EST_COUNT - sp_est_cnt + sp_max_cnt) 
sp_est_cnt = 
sp_max_cnt = 
sp_max = 
sp_est_cnt = sp_est_cnt + 1 

if (injevel > MIN_SPEECH_LEVEL1) AND ((vadreg = 1) OR (injevel > speechjevel)) 
sp_max_cnt = sp_max_cnt + 1 
sp_max = MAX(sp_max, injevel) 
if (sp_max_cnt > SP_ACTIVITY_COUNT) 
if (sp_max > MIN_SPEECH_LEVEL2) 
if (sp_max > speechjevel) 

speechjevel = speechjevel + SP_ALPHA JJP * (sp_max - speechjevel) 
else 

speechjevel = speechjevel + SP_ALPHA_DOWN * (sp_max - speechjevel) 
sp_max_cnt = 
sp_max = 

sp_est_cnt = 

4 Computational details 

A low level description has been prepared in form of ANSI C-code [1]. 
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